-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing+htlcswitch: fix stuck inflight payments #9150
base: master
Are you sure you want to change the base?
routing+htlcswitch: fix stuck inflight payments #9150
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi YY, made a quick pass through the change set. Looks pretty cool to me! Had a couple questions to make sure I understand the approach and also on whether the modification of GetAttemptResult is strictly necessary for this bug fix as it may have implications (albeit slight) on how the ChannelRouter can be used remotely.
routing/router.go
Outdated
@@ -123,7 +123,8 @@ type PaymentAttemptDispatcher interface { | |||
// longer be in flight. The switch shutting down is signaled by | |||
// closing the channel. If the attemptID is unknown, | |||
// ErrPaymentIDNotFound will be returned. | |||
GetAttemptResult(attemptID uint64, paymentHash lntypes.Hash, | |||
GetAttemptResult(attempt *channeldb.HTLCAttempt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this PR moves to transporting the hop + session key information across the interface boundary by way of channeldb.HTLCAttempt type. This is what allows the forwarding error decryption to be performed one layer deeper within the Switch rather than within the ChannelRouter. This may have implications for ability to support oblivious payments by providing optionality in server versus client side decryption of forwarding errors.
Maybe this is too future thinking to be relevant, but such an opportunity is a bit more natural when you consider alternative deployment scenarios in which ChannelRouter and Switch run remotely in different processes - perhaps both controlled by the same entity (eg: gRPC reverse proxy to make multiple nodes act as a single logical node) or controlled by different parties in the case where the client would like to build onions and send payments in a way that is private from a hosted node infrastructure provider.
Whether forwarding error decryption should be conducted server/Switch side or by the caller of GetAttemptResult could previously be signaled by the presence of the deobfuscator:
// extractResult uses the given deobfuscator to extract the payment result from
// the given network message.
func (s *Switch) extractResult(deobfuscator ErrorDecrypter, n *networkResult,
attemptID uint64, paymentHash lntypes.Hash) (*PaymentResult, error) {
...
// If the caller did not provide a deobfuscator, then we'll
// return the onion-encrypted blob that details why the HTLC was
// failed. This blob is only fully decryptable by the entity
// which built the onion packet.
if deobfuscator == nil {
return &PaymentResult{
EncryptedError: htlc.Reason,
}, nil
}
I could imagine a similar approach to optional Switch error decryption by constructing a HTLCAttempt object which does not set the fields needed to perform forwarding error decryption in the Switch, but the attempt.Circuit() call would have to allow that information not being there so the Switch could instead return the encrypted error blob via the result.
In our case, we control both the remote ChannelRouter and the lnd/Switch instances, so whether the forwarding error decryption occurs in the router or switch isn't a huge concern since privacy across the RPC boundary is not relevant.
From the perspective of a router client controlled separately from the entity operating the lnd/Switch instances: With the removal of any ChannelRouter error decryption logic, the client application could not re-use the ChannelRouter type, but it could at least in principle provide a different implementation of a path-finding, onion building, and payment life-cycle manager component that could do client-side decryption of forwarding errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?
Maybe we like the improvements this brings better than the currently somewhat hypothetical ability to allow ChannelRouter re-use in clients seeking to maintain privacy from hosted node providers. One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I followed - if you wanna create onions then you should use SendToOnions
?
Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?
Nope it's not needed but an improvement - an easy way to measure it is to run the payment benchmark, which increases a bit. This is more of a logical code separation since there's no need to re-create the error decryptor if we know it's not a fail but a settle.
One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.
I don't think that's the case, all paymentLifecycle
needs is to call RequestRoute
and pass the route to the htlcswitch
.
Anyway I removed this change to limit the scope of this PR, think we'll take a look at it again when working on #8834.
result *htlcswitch.PaymentResult) bool { | ||
|
||
// Save the keys so we know which items to delete from the map. | ||
attempts = append(attempts, a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only mark an attempt result for deletion if it is actually successfully handled without error by the call to handleAttemptResult below? If a result is not deleted from the map, it will be reprocessed when this function is again called on the next life-cycle loop iteration. On one hand retries sound good, but in this context maybe that risks a perpetually growing result map if certain results are borked and never able to be processed successfully 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I thought about this too - since the only error we get here is DB-related, I don't think retrying it again would help. This is also the current behavior. But I guess some logs would help so will add!
// Once the result is collected, we signal it by writing the | ||
// error to `resultCollected`. | ||
// Save the result and process it in the next main loop. | ||
p.switchResults.Store(attempt, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than update payment state in DB given an attempt result from collectResult, we instead defer any ControlTower/DB updates and simply collect and return the result so that we can update our DB state later from a centralized location. Am I following you correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly!
// means the payment lifecycle needs to be terminated. | ||
_, err := p.handleAttemptResult(a, result) | ||
if err != nil { | ||
errReturned = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any issue with masking errors of payment attempt results if there are multiple results in this map with an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch - imo it's fine to do so here as all we want is to catch at least one of the DB errors and abort the payment lifecycle.
payment.GetStatus(), ps.NumAttemptsInFlight, | ||
ps.RemainingAmt, remainingFees) | ||
// We update the payment state on every iteration. | ||
currentPayment, ps, err := p.refreshPayment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle any potential error from this call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
abd6e94
to
930fadf
Compare
This commit caches the creation of sphinx circuit and onion blob to avoid re-creating them again.
To shorten the method `resumePayment` and make each step more clear.
To further shorten the lifecycle loop.
This commit refactors `collectResultAsync` such that this method is now only responsible for collecting results from the switch. A new method `processSwitchResults` is added to process these results in the same goroutine where we fetch the payment from db, to make sure the lifecycle loop always have a consistent view of a given payment.
930fadf
to
b631432
Compare
Fix #8975
This PR aims at fixing a scenario where the payment could be stuck. Major changes are,
resumePayment
)update_fail_htlc
.TODOs
htlcswitch
routing